Global Detection of Complex Copying Relationships Between Sources
نویسندگان
چکیده
Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships. In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.
منابع مشابه
Statistical Analysis of Relationships between Monthly Maximum Temperatures in Iran and Global Mean Land-Ocean Temperature Anomalies
Global warming and the meaningful relationship between temperature and precipitation changes over different areas of the earth with temperature increment of the earth, are considered as the most important patterns of this century’s climate changes. Today, there is debate over climate change and global temperatures increasing. Damaging effects of this phenomenon on the planet is one of the most ...
متن کاملBalancing Management and Leadership in Complex Health Systems; Comment on “Management Matters: A Leverage Point for Health Systems Strengthening in Global Health”
Health systems, particularly those in low- and middle-income countries (LMICs), need stronger management and leadership capacities. Management and leadership are not synonymous, yet should be considered together as there can be too much of one and not enough of the other. In complex adaptive health systems, the multiple interactions and relationships between people and elements of the system me...
متن کاملKnowledge and Networks – Key Sources of Power in Global Health; Comment on “Knowledge, Moral Claims and the Exercise of Power in Global Health”
Shiffman rightly raises questions about who exercises power in global health, suggesting power is a complex concept, and the way it is exercised is often opaque. Power that is not based on financial strength but on knowledge or experience, is difficult to estimate, and yet it may provide the legitimacy to make moral claims on what is, or ought to be, on global health agendas. Twenty years ago p...
متن کاملMONITORING OF LEAD POISONING IN SIMPLE WORKERS OF A COPYING CENTER BY FLAME ATOMIC ABSORPTION SPECTROSCOPY
Inorganic lead compounds are used most widely in paint and pigment industries. Because of exposure to lead dust in workers in a copying center, we compared blood lead levels in these workers to normal individuals and investigated the relationship between job tenures of the workers and their blood lead levels. This survey was performed on 20 simple workers in a copying center of Tehran Univ...
متن کاملIntegration of Web Sources Under Uncertainty and Dependencies Using Probabilistic XML
We explore in this paper the problem of integrating several web data sources under uncertainty and dependencies. We present a real application of this from web sources about objects in the maritime domain where uncertainties and dependencies are ubiquitous. Uncertainties are mainly caused by imprecise data trackers and imperfect human knowledge whereas dependencies come from the frequent copyin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 3 شماره
صفحات -
تاریخ انتشار 2010